List of composition models for two words
TODO: Yang et al. 2014 A general formula for all models that compose a two-word phrase is: p = f(u, v; \theta_R) This is the general formula of most, if not all, existing composition models for two words. We can assume a single vector space without loss of generality (see also Vector). Simple operation models Vector addition f(u,v) = u+v Element-wise multiplication f(u,v) = u \odot v = u_1v_1,...,u_nv_n^{\mathrm{T}} Despite being simple, this model achieves good results in many tasks. Tensor product f(u,v) = u \otimes v = \left[ \begin{array}{cccc} u_1v_1 & u_1v_2 & \cdots & u_1v_n \\ u_2v_1 & u_2v_2 & \cdots & u_2v_n \\ \vdots & \vdots & \ddots & \vdots \\ u_nv_1 & u_nv_2 & \cdots & u_nv_n \\ \end{array} \right] This model maintains information at the cost of explosive dimensionality. Circular convolution f(u,v) = u \circledast v = c_1,...,c_n^{\mathrm{T}} where c_i = \sum_j u_j \cdot v_{(i-j)} It tries to fix dimensionality explosion by summing along transdiagonal elements. Nearest neighbour (Kintsch's) model f(u,v) = u+v+\sum_i n_i where \{n_i\} stands for a predefined number of neighbours of the predicate which are also close to the argument.Walter Kintsch. Predication. Cognitive Science, 25(2):173–202, 2001. The behavior of Kintsch's function is hard to characterized because it is influenced by an unknown part of the vocabulary. However some researches have shown its performance to be comparable to that of additive model.Jeff Mitchell and Mirella Lapata. Composition in Distributional Models of Semantics. Cognitive Science, 34(8):1388–1429, 2010. ISSN 1551-6709. doi: 10.1111/j.1551-6709.2010.01106.x. Lexical function models In lexical function models, some words are considered as functions operating on other words. The composition function is simply function application or, when they are expressed in matrices and tensors, multiplication. In this aspect, lexical function models are similar to basic operation models. However lexical function models do have the ability to learn from data. They differ from other models in that the parameters they fit fall out of the composition function into the VSM. In their treatment, some words are meant to compose. Surprisingly, some researches find out that they are still comparable to each other individually.Marco Baroni and Roberto Zamparelli. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1183–1193. Association for Computational Linguistics, 2010. This line of research starts with the composition of adjectives and nouns: f(A,v) = Av where A \in \mathbb{R}^{d \times d} represents an adjectives and v \in \mathbb{R}^d stands for a noun. Since then researchers have developed models for determiners, intransitive verbs, transitive verbs.Edward Grefenstette, Georgiana Dinu, Yao-Zhong Zhang, Mehrnoosh Sadrzadeh, and Marco Baroni. Multi-Step Regression Learning for Compositional Distributional Semantics. CoRR, abs/1301.6, 2013. Although elegant and effective, it faces practicality problem because of the explosive size of tensors (if nouns have one thousand dimensions, adjectives will have one million of them and for transitive verbs, one billion). Practical lexical function Paperno (2014) tries to fix this by describing each word by a tuple of a vector and some matrices where the number of matrices equal to the number of arguments that word can take.Denis Paperno, Nghia The Pham, and Marco Baroni. A practical and linguistically-motivated approach to compositional distributional semantics. In Proceedings of ACL 2014, 2014. Parametrized models Weighted additive f(u, v; \theta_R) = \alpha_R u + \beta_R v Mixing additive and multiplicative models Introduced in Mitchell 2008.Jeff Mitchell and Mirella Lapata. Vector-based Models of Semantic Composition. In ACL, pages 236–244, 2008. f(u, v; \theta_R) = \alpha_R u + \beta_R v + \gamma_R (u \odot v) Dilation f(u, v; \theta_R) = (uu)v + (\lambda_R-1)(uv)u Full additive f(u, v; \theta_R) = A_R u + B_R v where u,v \in \mathbb{R}^n are word vectors provided by a VSM and A_R, B_R \in \mathbb{R}^{n \times n} are matrices to be learned. Neural networks Single layer neural network f(u, v; \theta) = \tanh(W_eu;v+b_e) The net is placed in a recursive autoencoder and trained by minimizing reconstruction error of all nodes in a set of parse trees. The authors used the same set of parameters for all syntactic relations.Richard Socher, Eric H Huang, Jeffrey Pennington, Andrew Y Ng, and Christopher D Manning. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. NIPS, 24:1–9, 2011. Deeper networks h = \tanh(W_e^{(1)}u;v+b_e^{(1)}) f(u,v; \theta) = \tanh(W_e^{(2)}h+b_e^{(2)}) Matrix-vector Socher et al. (2012) incorporated some ideas from lexical functions.Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP), number Mv, 2012. They represented each word by a matrix-vector pair. The composition of two words u=(A,a) and v=(B,b) is carried out in two parts: f_V(u, v; \theta) = g \left(W_V \left[ \begin{matrix} Ba \\ Ab \end{matrix} \right] + b \right) where g is a sigmoidal function, for example sigmoid or tanh. f_M(u, v; \theta) = W_M\left[ \begin{matrix} A \\ B \end{matrix} \right] f(u,v; \theta) = (f_M(u, v; \theta), f_V(u, v; \theta)) All matrices and vectors are not predetermined but computed via training on a parsed corpus. Even for low dimensionality (e.g. 100 dimensions for each vector), the number of parameters grow too large. Therefore they restricted the rank of matrices by setting: A = UV + diag(a) where U \in \mathbb{R}^{n \times r}, V \in \mathbb{R}^{r \times n}, a \in \mathbb{R}^{n} . The rank was fixed to be 3 in all their experiments. References Category:Composition model